Accurately identifying hemagglutinin using sequence information and machine learning methods

Introduction Hemagglutinin (HA) is responsible for facilitating viral entry and infection by promoting the fusion between the host membrane and the virus. Given its significance in the process of influenza virus infestation, HA has garnered attention as a target for influenza drug and vaccine development. Thus, accurately identifying HA is crucial for the development of targeted vaccine drugs. However, the identification of HA using in-silico methods is still lacking. This study aims to design a computational model to identify HA. Methods In this study, a benchmark dataset comprising 106 HA and 106 non-HA sequences were obtained from UniProt. Various sequence-based features were used to formulate samples. By perform feature optimization and inputting them four kinds of machine learning methods, we constructed an integrated classifier model using the stacking algorithm. Results and discussion The model achieved an accuracy of 95.85% and with an area under the receiver operating characteristic (ROC) curve of 0.9863 in the 5-fold cross-validation. In the independent test, the model exhibited an accuracy of 93.18% and with an area under the ROC curve of 0.9793. The code can be found from https://github.com/Zouxidan/HA_predict.git. The proposed model has excellent prediction performance. The model will provide convenience for biochemical scholars for the study of HA.


Introduction
Influenza is a contagious respiratory disease, posing a significant threat to human health and causing varying degrees of disease burden globally (1,2).Hemagglutinin (HA), a glycoprotein on the surface of influenza viruses, mediates viral entry and infection by binding to host sialic acid receptors (3).The highly conserved stem or stalk region of HA has been identified as a promising target for the development of a universal influenza vaccine (4).Accurate identification of HA is crucial for targeted vaccine and drug development.
With the increasing maturity of protein sequence coding methods and machine learning algorithms, sequence-based protein recognition has been an effective approach for rapid Zou et al. 10.3389/fmed.2023.1281880Frontiers in Medicine 02 frontiersin.orgidentification of protein.It achieves classification and identification of specific proteins using protein sequence coding methods and machine learning algorithms, which has been widely used in the prediction studies of cell-penetrating peptides (5), hemolytic peptide (6), anticancer peptides (7), hormone proteins (8), autophagy proteins (9), and Anti-CRISPR proteins (10), etc., because of its high recognition accuracy in the protein identification study.Despite the pivotal role of HA in influenza virus infection, existing machine learning-based research on HA has primarily focused on influenza virus subtype classification (11,12), influenza virus host prediction (13), influenza virus mutation and evolution prediction (14), HA structure-function analysis (15), and influenza virus pathogenicity and prevalence prediction (16).However, there are currently no approaches for HA identification based on HA sequence information and machine learning techniques.
In this study, we proposed a machine learning-based prediction model for HA to achieve effective identification.Firstly, we constructed a benchmark dataset based on existing protein databases.Next, we employed feature extraction methods to encode the protein sequences.Subsequently, we fused all the extracted features and utilized the analysis of variance (ANOVA) combined with incremental feature selection (IFS) strategies to obtain the most informative feature subset.Finally, the HA prediction model was developed based on this optimal feature subset.The workflow is shown in Figure 1.

Benchmark dataset
A benchmark dataset is essential for bioinformatics analysis (17,18).The dataset used in this study was collected from the Universal Protein Resource (UniProt) (19).To ensure the quality of the dataset, several pre-processing steps were performed.Protein sequences containing nonstandard letters (e.g., 'B' , 'U' , 'X' , 'Z') were eliminated.Redundancy removal was done using CD-HIT (20) to remove sequences with high similarity.The cutoff value was set to 80%, and sequences with a similarity higher than 80% were removed.The non-HA dataset was down-sampled to ensure a balanced dataset with equal positive and negative samples.The final benchmark dataset consisted of 212 protein sequences, including 106 HA and 106 non-HA samples.The dataset was randomly split into a training dataset and a test dataset in a 4:1 ratio.The above-mentioned model training set data and test set data are included in https://github.com/Zouxidan/HA_predict.git.At the same time, a dataset named 'predict_ data.txt'for testing is also included.

Feature extraction
Feature extraction plays a crucial role in protein identification and prediction (10,(21)(22)(23)(24)(25).However, machine learning algorithms cannot directly process protein sequence information for computation and model construction.Therefore, it is necessary to convert protein sequence information into numerical data that can be understood and utilized by machine learning algorithms (26)(27)(28)(29).Here, we employed various methods for feature extraction of protein sequences, including Amino Acid Composition (AAC), Dipeptide Composition (DPC), Tripeptide Composition (TPC), Composition of k-spaced Amino Acid Pairs (CKSAAP), Pseudo-Amino Acid Composition (PseAAC), PseAAC of Distance-Pairs and Reduced Alphabet (DCP).These sequence feature extraction approaches have been widely adopted in the field of bioinformatics (30)(31)(32).The implementation of these feature extraction methods was based on iLearnPlus (33).
A protein sequence P of length L can be represented as: Workflow diagram for constructing the HA prediction model.
where R 1 denotes the first amino acid of the sequence, R 2 denotes the second amino acid, and so on.

AAC
AAC is a commonly used method for protein sequence feature extraction, which involves 20 feature vectors.AAC was defined as: where a i denotes the i-th natural amino acid and N(a i ) denotes the frequency of amino acid a i in the protein sequence.

DPC
Similar to AAC, DPC counts the frequency of amino acids, but it focuses on the frequency of two adjacent amino acids in a protein sequence.DPC was defined as: where (a i , a j ) denotes two adjacent amino acids and N(a i , a j ) denotes the frequency of the amino acid pair (a i , a j ) in the protein sequence.

TPC
TPC is another feature extraction method that considers the relationship among three adjacent amino acids, providing more protein sequence information compared to AAC and DPC.TPC was defined as: where (a i , a j , a z ) denotes the combination of three adjacent amino acids, and N(a i , a j , a z ) denotes the frequency of the tripeptide combination (a i , a j , a z ) in the protein sequence.

CKSAAP
To obtain further sequence information, Chen et al. proposed CKSAAP (34) which was defined as: where k denotes the number of amino acids spaced between two amino acids, x k denotes k arbitrary amino acids, (a i , x k , a j ) denotes the spaced amino acid pair, and N(a i , x k , a j ) denotes the frequency of the spaced amino acid pair (a i , x k , a j ) in the protein sequence.

PseAAC
To incorporate protein sequence ordinal information and improve prediction quality, a powerful feature, called PseAAC, was proposed, which incorporated the physicochemical characteristics of amino acids.PseAAC was defined as: where x i denotes the normalized amino acid frequency, ω denotes the weight factor for short-range and long-range, and θ j denotes the j-th sequence correlation factor.
θ j was calculated as: Θ(R i + R i + j ) was defined as: where H 1 (R i ), H 2 (R i ), and M(R i ) denote the standardized hydrophobicity, standardized hydrophilicity, and standardized side chain mass of the amino acid R i , respectively.
The hydrophobicity, hydrophilicity, and side chain mass of amino acids were standardized using the following equations: where H 1 (R i ), H 2 (R i ), and M(R i ) denote the standardized hydrophobicity, standardized hydrophilicity, and standardized side chain mass of amino acids, respectively, and H R i 1 0 ( ), H R i 2 0 ( ), and M R i 0 ( ) denote the corresponding raw physicochemical properties of amino acids.

DCP
To incorporate more protein sequence order information and reduce the impact of high-dimensional features, Liu et al. proposed DCP (35).Based on a validated amino acid simplification alphabet scheme (36), three simplified amino acid alphabets were defined as: For any simplified amino acid alphabet, DCP was defined as: where z denotes the number of amino acid clusters in the simplified alphabet, and N cp cp ) denotes the frequency of any two amino acid clusters with distance d in the protein sequence.

Feature fusion and
Different feature extraction methods offer diverse interpretations and representations of protein sequences.Relying solely on a single feature extraction method may limit the information provided by a single feature.To obtain a more comprehensive and reliable interpretation of protein sequences, we fused all features to create a fused feature set, resulting in a 10,713-dimensional feature set (20 + 400 + 8,000 + 800 + 30 + 1,463).We then selected the optimal feature subset using ANOVA and IFS.
ANOVA, a widely used feature selection tool, tests the difference in means between groups to determine whether the independent variable influences the dependent variable.Its high accuracy has made it an effective choice for feature selection (8).For a feature f, its F-value was calculated based on the principle of ANOVA as follows: where F(f) represents the F-value of feature f, SSA represents the sum of squares between groups, SSE represents the sum of squares within groups, K-1 and N-K denote the degrees of freedom between and within groups, respectively.N is the total number of samples, and K is the number of groups.
SSA and SSE were calculated as follows: where f i j , ( ) denotes the j-th feature of the i-th group, K represents the number of groups, and k i represents the total number of samples in the i-th group.
A larger F-value indicates a stronger influence of the feature on data classification, thereby contributing more to the data classification results.In the feature set, the large amount of data, redundant data and noise will not only result in higher computational costs, but also cause the phenomenon of overfitting or reduced accuracy of the prediction model.The above fusion feature set contains 10,713 features, which is a large number of features.For saving computational time and reducing computational cost, we firstly use ANOVA to initially filter to obtain the 1,000 features which have the greatest influence on the classification results.
Next, the optimal subset of features was determined by searching the top 1,000 features ranked by F-value using IFS.IFS is a frequently employed feature selection method in the field of bioinformatics (37,38).The specific process of IFS is as follows.Firstly, all features were sorted in descending order according to their F-values obtained from ANOVA.Then, each feature was sequentially added to the feature set, and a model was constructed using support vector machine (SVM) for each newly formed feature subset.Grid search was utilized to obtain optimal models, and their performance was evaluated using 5-fold cross-validation.The optimal feature subset was defined as the set of features that maximized the model's accuracy.

Machine learning methodology and modeling
The advancement of machine learning has provided an effective approach to solving biological problems (39)(40)(41)(42).Utilizing machine learning techniques to identify proteins based on sequence features has proven to be a rapid and widely applied method in various studies (43)(44)(45).
Constructing appropriate models is crucial for achieving accurate and robust predictions.In this study, we selected four commonly used machine learning algorithms, namely K-nearest neighbor (KNN) (46), logistic regression (LR) (47), random forest (RF) (48), and SVM (49), to build the fundamental classifier model for the HA dataset.The optimal parameters for each algorithm were obtained using grid search.To further enhance the model's accuracy and generalization ability, we developed an integrated classifier model by combining the four basic classifier models.The Stacking algorithm was employed, with logistic regression serving as the second-layer classifier.All the machine learning models utilized in this study were implemented using scikit-learn (50).
KNN is a simple yet effective machine learning algorithm based on the implementation of the distance between data and data.LR is a binary classification algorithm based on the sigmoid function, which classifies samples by their corresponding output values.In RF, the result of prediction is determined by the vote or average of decision trees.The basic principle of SVM is to separate two classes of training data by defining a hyperplane and maximizing the distance between the two classes.
The Stacking algorithm is one of the widely used integrated learning methods, which obtains predictive models with higher accuracy and better generalization ability by combining basic classifier models.The Stacking algorithm was initially proposed by Wolpert (51).Its basic idea is to obtain an optimal integrated classifier model by training and combining multiple basic classifier models.In the Stacking algorithm, machine learning algorithms with strong learning and fitting capabilities are frequently used to construct basic classifier models for adequate learning and interpretation of training data.To reduce the degree of overfitting, simple algorithms with strong interpretations are commonly used to construct integrated classifier models.

Performance evaluation
To assess the effectiveness of the constructed models, we employed 5-fold cross-validation and independent testing.The performance of the proposed model was evaluated using several metrics, including accuracy (ACC), sensitivity (Sn), specificity (Sp), Matthew's correlation coefficient (MCC), and the area under the receiver operating characteristic curve (AUC) (27,(52)(53)(54)(55)(56).ACC, Sn, Sp, and MCC were expressed as:

MCC TP TN FP FN TP FN TP FP TN FP TN FN
where TP, TN, FP, and FN represent the following respectively: correctly identified positive samples, correctly identified negative samples, incorrectly identified negative samples, and incorrectly identified positive samples.
Additionally, we utilized the receiver operating characteristic (ROC) curve to evaluate model performance.A higher AUC value indicates better model performance, as it reflects the proximity to 1 according to the underlying principle.

Optimal feature subset
We constructed optimal feature subsets using ANOVA and IFS and evaluated the models for each subset using the ACC. Figure 2A shows the IFS curve for the fusion feature set.When the feature set contained 773 features, the prediction model achieved a maximum ACC value of 0.9585.
In optimal feature subset, 13-dimensional AAC, 73-dimensional DPC, 629-dimensional TPC, 29-dimensional CKSAAP, and 29-dimensional DCP features are included.Notably, PseAAC is not included in this subset, suggesting that it is less effective in classifying HA compared to the other features.Furthermore, TPC has the highest proportion in optimal feature subset, indicating that TPC provides the best identification and differentiation ability among the six methods for feature extraction.
To demonstrate the impact of optimal feature subsets on model performance, we compared the performance of SVM prediction models constructed with optimal feature subsets to those constructed with six single feature sets.Each model was optimized using grid search within the same parameter range, and all models were evaluated using 5-fold cross-validation.Table 1 presents the results of the comparison, and Figure 2B shows the ROC curves for the 5-fold cross-validation of these models.The model constructed with the optimal feature subset achieved an ACC of 94.06% and an AUC of 0.970, outperforming the models constructed with other single feature sets.These results indicate that the optimal feature subset significantly improved the model's prediction performance.

Model construction and evaluation
We constructed four basic models and an integrated model based on the optimal feature subsets.The optimal parameters for each algorithm were as follows: K = 52 for KNN, n = 62 for RF, f = 6 for the number of features considered during best-split search, ξ = 4 for the SVM kernel parameter, and C = 32 for the regularization parameter.
Table 2 presents the performance comparison of different classifier models using two testing methods.Figures 2C,D show the ROC curve of the constructed integration model using these wo testing methods.With 5-fold cross-validation, the proposed integrated model achieved an ACC of 95.85% and an AUC of 0.9863.On the independent test set, the integrated model achieved an ACC of 93.18% and an AUC of 0.9793.These results demonstrate that the proposed integrated model exhibited better HA prediction capability, improved model performance, and enhanced generalization ability compared to a single model.

Comparison of other machine learning algorithms
We have created two models based on optimal feature subsets and compared their performance to demonstrate the superiority of our proposed model.The comparison results are presented in Table 3, where we compared the model constructed with the XGboost algorithm with our proposed model.The main parameters of the model constructed based on the XGboost algorithm are as follows: max_depth = 3, learning_rate = 0.16, colsample_bytree = 0.85, subsample = 0.75.The results in Table 3 show that our model has good classification performance.4. In the performance evaluation of the model using the leave-one-out method, the model achieves an ACC of 93.45% and

Conclusion
Hemagglutinin (HA) is a vital glycoprotein found on the surface of influenza viruses, and accurately identifying HA is crucial for the development of targeted vaccine drugs.In this study, we proposed a prediction model based on HA protein sequence features.The model was constructed using the Stacking algorithm, incorporating an optimal subset of features and a basic classifier model.Our results demonstrated that the constructed model exhibits excellent predictive capacity and generalization ability.
We anticipate that the model will prove valuable in the effective identification and prediction of HA.Moving forward, we plan to explore additional feature extraction methods and optimize our prediction model to further enhance its performance.Additionally, we are committed to developing an accessible web server to facilitate the identification and prediction of HA.
In summary, our research provides a promising approach to accurately identifying HA and lays the foundation for the development of targeted vaccine drugs.We believe that our findings contribute to the advancement of influenza research and offer valuable insights for future studies in this field.

4 .
Leave-one-out validation of theDue to the small sample data size, model robustness may be questioned.To ensure credible results, we use the leave-one-out method to re-validate model performance.The results of the model performance evaluation based on the leave-one-out method are shown in Table

FIGURE 2
FIGURE 2Performance analysis for optimal feature subsets and HA prediction models.(A) IFS curve for fusion features.(B) ROC curves of models constructed based on optimal feature subsets and six single feature sets.(C) ROC curves of the integrated classifier model with 5-fold cross-validation.(D) ROC curves of the integrated classifier model with independent testing.

TABLE 1
Performance of models constructed based on optimal feature subsets and six single feature sets.

TABLE 2
Performance of the integrated classifier model and the four basic classifier models.

TABLE 3
Performance of the stacking classifier model and the XGboost classifier models.

TABLE 4
Performance evaluation based on the leave-one-out method.